In the following exercises we will wrangle some data on population from Gapminder.
library(readr)
gap_pop <- read_csv("../data/gapminder/population_total.csv")
As you may have noticed, the name of the first column in the dataset does not match its content.
library(dplyr)
gap_pop <- gap_pop %>%
rename(country = "Total population")
gap_pop
## # A tibble: 275 x 82
## country `1800` `1810` `1820` `1830` `1840` `1850` `1860` `1870`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhaz~ NA NA NA NA NA NA NA NA
## 2 Afghan~ 3280000 3280000 3323519 3448982 3625022 3810047 3973968 4169690
## 3 Akroti~ NA NA NA NA NA NA NA NA
## 4 Albania 410445 423591 438671 457234 478227 506889 552800 610036
## 5 Algeria 2503218 2595056 2713079 2880355 3082721 3299305 3536468 3811028
## 6 Americ~ 8170 8156 8142 8128 8114 7958 7564 7057
## 7 Andorra 2654 2654 2700 2835 3026 3230 3436 3654
## 8 Angola 1567028 1567028 1597530 1686390 1813100 1949329 2110747 2285417
## 9 Anguil~ 2025 2025 2064 2177 2338 2511 2693 2888
## 10 Antigu~ 37000 37000 37000 37000 37000 37000 36532 35546
## # ... with 265 more rows, and 73 more variables: `1880` <dbl>,
## # `1890` <dbl>, `1900` <dbl>, `1910` <dbl>, `1920` <dbl>, `1930` <dbl>,
## # `1940` <dbl>, `1950` <dbl>, `1951` <dbl>, `1952` <dbl>, `1953` <dbl>,
## # `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>,
## # `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>,
## # `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>,
## # `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## # `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## # `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## # `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## # `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## # `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## # `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>,
## # `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
## # `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
## # `2014` <dbl>, `2015` <dbl>
As you have probably noticed, the data are currently in wide format.
select() function.
gap_pop %>%
select(country, "1990":"1999")
## # A tibble: 275 x 11
## country `1990` `1991` `1992` `1993` `1994` `1995` `1996` `1997`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhaz~ NA NA NA NA NA NA NA NA
## 2 Afghan~ 1.21e7 1.28e7 1.37e7 1.48e7 1.59e7 1.68e7 1.75e7 1.80e7
## 3 Akroti~ 1.41e4 1.42e4 1.43e4 1.44e4 1.45e4 1.46e4 1.47e4 1.48e4
## 4 Albania 3.28e6 3.28e6 3.24e6 3.19e6 3.14e6 3.11e6 3.09e6 3.09e6
## 5 Algeria 2.59e7 2.66e7 2.72e7 2.78e7 2.84e7 2.89e7 2.94e7 2.99e7
## 6 Americ~ 4.70e4 4.84e4 4.96e4 5.07e4 5.18e4 5.29e4 5.39e4 5.49e4
## 7 Andorra 5.45e4 5.67e4 5.89e4 6.10e4 6.27e4 6.39e4 6.43e4 6.41e4
## 8 Angola 1.11e7 1.15e7 1.18e7 1.22e7 1.26e7 1.30e7 1.34e7 1.38e7
## 9 Anguil~ 8.33e3 8.67e3 8.98e3 9.27e3 9.54e3 9.81e3 1.01e4 1.03e4
## 10 Antigu~ 6.19e4 6.24e4 6.34e4 6.49e4 6.66e4 6.83e4 7.02e4 7.22e4
## # ... with 265 more rows, and 2 more variables: `1998` <dbl>, `1999` <dbl>
For the next data wrangling steps, it is more convenient to have the data in long format.
library(tidyr)
gap_pop <- gap_pop %>%
gather(-country, key = "year", value = "pop")
For some analyses, it might help (or even be necessary) to only work with a specific subset of observations.
Create two new dataframes that include different subets of the gap_pop data:
Data for all countries for the 19th century (name this one gap_pop_19thcen),
gap_pop_19thcen <- gap_pop %>%
filter(year < 1900)
gap_pop_19thcen
## # A tibble: 2,750 x 3
## country year pop
## <chr> <chr> <dbl>
## 1 Abkhazia 1800 NA
## 2 Afghanistan 1800 3280000
## 3 Akrotiri and Dhekelia 1800 NA
## 4 Albania 1800 410445
## 5 Algeria 1800 2503218
## 6 American Samoa 1800 8170
## 7 Andorra 1800 2654
## 8 Angola 1800 1567028
## 9 Anguilla 1800 2025
## 10 Antigua and Barbuda 1800 37000
## # ... with 2,740 more rows
gap_pop_ger_21stcen <- gap_pop %>%
filter(country == "Germany", year > 1999)
gap_pop_ger_21stcen
## # A tibble: 16 x 3
## country year pop
## <chr> <chr> <dbl>
## 1 Germany 2000 81895925
## 2 Germany 2001 81809438
## 3 Germany 2002 81699829
## 4 Germany 2003 81569481
## 5 Germany 2004 81417791
## 6 Germany 2005 81246801
## 7 Germany 2006 81055904
## 8 Germany 2007 80854515
## 9 Germany 2008 80665906
## 10 Germany 2009 80519685
## 11 Germany 2010 80435307
## 12 Germany 2011 80424665
## 13 Germany 2012 80477952
## 14 Germany 2013 80565861
## 15 Germany 2014 80646262
## 16 Germany 2015 80688545
For some analyses as well as for plotting the data, it makes sense to define the country variable as a factor.
mutate() function for this.
gap_pop <- gap_pop %>%
mutate(country = as.factor(country),
year = as.integer(year),
pop = as.integer(pop))
gap_pop
## # A tibble: 22,275 x 3
## country year pop
## <fct> <int> <int>
## 1 Abkhazia 1800 NA
## 2 Afghanistan 1800 3280000
## 3 Akrotiri and Dhekelia 1800 NA
## 4 Albania 1800 410445
## 5 Algeria 1800 2503218
## 6 American Samoa 1800 8170
## 7 Andorra 1800 2654
## 8 Angola 1800 1567028
## 9 Anguilla 1800 2025
## 10 Antigua and Barbuda 1800 37000
## # ... with 22,265 more rows
Let’s imagine that we want to combine the population data that we have with some other country-level data (we will discuss joining datasets in session B2 on relational data tomorrow). If the data come from different sources, it is quite likely that the names of the countries differ between them. If we want to join the datasets, we need to harmonize the country names.
gap_pop %>%
mutate(country = recode(country,
"Cook Is" = "Cook Island",
"Kyrgyz Republic" = "Kyrgyzstan",
"Micronesia, Fed. Sts." = "Micronesia"))
## # A tibble: 22,275 x 3
## country year pop
## <fct> <int> <int>
## 1 Abkhazia 1800 NA
## 2 Afghanistan 1800 3280000
## 3 Akrotiri and Dhekelia 1800 NA
## 4 Albania 1800 410445
## 5 Algeria 1800 2503218
## 6 American Samoa 1800 8170
## 7 Andorra 1800 2654
## 8 Angola 1800 1567028
## 9 Anguilla 1800 2025
## 10 Antigua and Barbuda 1800 37000
## # ... with 22,265 more rows
Of course, instead of changing the variable types at this point, we could have also specified the column types when reading in the data (see the session and exercises on importing data).
In the next step, we want to create some new variables based on ones that already exist in the dataset.
Create the variable Population in thousands in the gap_pop dataset (name the new variable pop_in_thousands).
lag() function.
gap_pop %>%
mutate(pop_in_thousands = pop/1000)
## # A tibble: 22,275 x 4
## country year pop pop_in_thousands
## <fct> <int> <int> <dbl>
## 1 Abkhazia 1800 NA NA
## 2 Afghanistan 1800 3280000 3280
## 3 Akrotiri and Dhekelia 1800 NA NA
## 4 Albania 1800 410445 410.
## 5 Algeria 1800 2503218 2503.
## 6 American Samoa 1800 8170 8.17
## 7 Andorra 1800 2654 2.65
## 8 Angola 1800 1567028 1567.
## 9 Anguilla 1800 2025 2.02
## 10 Antigua and Barbuda 1800 37000 37
## # ... with 22,265 more rows
gap_pop_ger_21stcen %>%
mutate(pop_perc_change = (pop - lag(pop))/pop*100)
## # A tibble: 16 x 4
## country year pop pop_perc_change
## <chr> <chr> <dbl> <dbl>
## 1 Germany 2000 81895925 NA
## 2 Germany 2001 81809438 -0.106
## 3 Germany 2002 81699829 -0.134
## 4 Germany 2003 81569481 -0.160
## 5 Germany 2004 81417791 -0.186
## 6 Germany 2005 81246801 -0.210
## 7 Germany 2006 81055904 -0.236
## 8 Germany 2007 80854515 -0.249
## 9 Germany 2008 80665906 -0.234
## 10 Germany 2009 80519685 -0.182
## 11 Germany 2010 80435307 -0.105
## 12 Germany 2011 80424665 -0.0132
## 13 Germany 2012 80477952 0.0662
## 14 Germany 2013 80565861 0.109
## 15 Germany 2014 80646262 0.0997
## 16 Germany 2015 80688545 0.0524
Finally, let’s combine two basic data wrangling steps to answer an actual question with the data.
filter() and arrange() the gap_pop dataset.
gap_pop %>%
filter(year == 2015) %>%
arrange(-pop) %>%
head(n = 5)
## # A tibble: 5 x 3
## country year pop
## <fct> <int> <int>
## 1 China 2015 1376048943
## 2 India 2015 1311050527
## 3 United States 2015 321773631
## 4 Indonesia 2015 257563815
## 5 Brazil 2015 207847528
gap_pop %>%
filter(year == 2015) %>%
arrange(pop) %>%
head(n = 5)
## # A tibble: 5 x 3
## country year pop
## <fct> <int> <int>
## 1 Holy See 2015 800
## 2 Tokelau 2015 1250
## 3 Niue 2015 1610
## 4 Falkland Is (Malvinas) 2015 2903
## 5 St. Helena 2015 3961